Spring R ’24
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
| # | Date | Title |
|---|---|---|
| 1 | 2/22 | Getting Started with Tabular Data |
| 2 | 2/29 | Working with Data Frames |
| 3 ⭐ | 3/7 | Data Visualization |
| 4 | 3/21 | Inference and Modeling Intro |
| 5 | 3/28 | Machine Learning Intro |
“ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”
library(palmerpenguins)
# load palmerpenguins' data into your environment:
data(penguins)
names(penguins)[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
palmerpenguins is one of many examples of an R package which functions as a downloadable data set.
An aesthetic mapping, created with aes(), associates an aesthetic property of the plot with a variable in your data.
penguins example: I want a scatter plot of flipper length on x and bill length on y. The aesthetic mapping would be:
aes(x = flipper_length_mm, y = bill_length_mm)
Common aesthetics:
x, ycolor, filllinetypesizeshape (of points)geom_histogram(mapping, binwidth) and geom_density(mapping)
A histogram “bins” a variable’s values and charts how many values are in each bin, giving a sense of central tendency and spread. A density plot works similarly, but it produces a smooth curve while sacrificing countable units.
# distribution of body mass:
penguins %>%
ggplot() +
geom_histogram(mapping = aes(x = body_mass_g))
# remove NA values, and specify bin width:
penguins %>%
drop_na(body_mass_g) %>%
ggplot() +
geom_histogram(mapping = aes(x = body_mass_g),
binwidth = 100)
# density plot:
penguins %>%
ggplot() +
geom_density(aes(x = body_mass_g))
# layering histogram and density plot
# note that 1) mapping is specified in ggplot(), and inherited by geoms; 2) histogram's y mapping needs adjusting to be compatible with density plot
penguins %>%
drop_na(body_mass_g) %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(mapping=aes(y=after_stat(density)), binwidth = 100) +
geom_density()geom_violin(), geom_boxplot(). Mapping a categorical variable onto an axis will make a violin or box plot for each level of the variable.
A violin plot mirrors a density plot across the axis. A box plot marks interquartile range (white box), median (line inside the box), tails, and outliers (individual points). (where outlier has a distance \(\gt 1.5 \times \mathrm{IQR}\) from the median)
geom_point() plots pointsA scatter plot compares continuous variables \(x\) and \(y\) on a Cartesian plane. geom_point() can take several aesthetics (size, shape, color), but be careful not to overload your plot with information.
geom_smooth() adds a smoother to a scatter plotA natural next step to a scatter plot is to fit a regression model. geom_smooth() does this job; specify the modeling function you want with the method argument. The default methods are loess, local polynomial regression fitting (when \(n \lt 1000\)), or gam, generalized additive model with restricted maximum likelihood (when \(n \geq 1000\)). See ?stats::loess and ?mgcv::gam for more info on these modeling functions.
(If you have fit your own model outside of ggplot, you should instead use predict() to generate points to plot with geom_line().)
# using default loess:
penguins %>%
ggplot(aes(x = flipper_length_mm,
y = bill_length_mm)) +
geom_point() +
geom_smooth()
# using a linear model, lm():
penguins %>%
ggplot(aes(x = flipper_length_mm,
y = bill_length_mm)) +
geom_point() +
geom_smooth(method="lm")
# encode species as color:
penguins %>%
ggplot(aes(x = flipper_length_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
geom_smooth(method="lm")geom_bar() makes bar charts of countspenguins %>%
ggplot(aes(x = species)) +
geom_bar()
# note 2 variables, y axis, and dodge position
# we also reorder the factors for aesthetic reasons
# ...and note unexpected colors!
penguins %>%
mutate(species = fct_rev(fct_infreq(species)),
sex = fct_infreq(sex)) %>%
ggplot(aes(y = species, fill = sex)) +
geom_bar(position = position_dodge2(reverse=TRUE, preserve="single"))Perform summary calculations using dplyr, then use geom_bar() with stat = "identity" to plot the numbers as-is (default is "count"). ⚠ When plotting a summary statistic, one should also include error! geom_errorbar(), geom_errorbarh(), geom_linerange()
Example: mean body mass for each species and gender, with standard error
penguins %>%
drop_na(body_mass_g) %>%
group_by(species) %>%
summarize(mean_mass_g = mean(body_mass_g),
standard_error = sd(body_mass_g)/sqrt(n())
) %>%
ggplot(aes(x = species, y = mean_mass_g)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymax = mean_mass_g + standard_error,
ymin = mean_mass_g - standard_error), width = 0.5, color="blue")
# note 2 variables, y axis, and dodge position
# we also reorder the factors for aesthetic reasons
# ...and note unexpected colors!
penguins %>%
drop_na(body_mass_g) %>%
group_by(species, sex) %>%
summarize(mean_mass_g = mean(body_mass_g),
standard_error = sd(body_mass_g)/sqrt(n())
) %>%
ggplot(aes(y = species, fill = sex, x = mean_mass_g)) +
geom_bar(stat="identity",
position = position_dodge2()) +
geom_errorbarh(aes(xmin = mean_mass_g - standard_error,
xmax = mean_mass_g + standard_error),
position = position_dodge2())Sometimes we want to compare proportions of category membership. With position = "fill", each bar has the same height, and the fill aesthetic works proportionately.
A pie chart in ggplot2 terms is a proportional bar cast onto a polar coordinate system.
The two-way table or contingency table format shows the conditional distribution of observations among two categorical variables. In penguin terms, how many penguins of each species were observed on each of three islands?
A graphical version of the previous table is given by geom_tile(). However, you will need to group and summarize the data first.
geom_line() connects point observationsTime series data are the most common application for line graphs. Since line graphs expect a single \(y\) for each \(x\), I recommend using dplyr to generate the table that you need. The examples use n() as the summarizing function, but one could also use sum() of a variable (for example).
# observations over time
penguins %>%
group_by(year) %>%
summarize(n = n()) %>%
ggplot() +
geom_line(aes(x=year, y=n))
# a more interesting example, using tropical storm/hurricane data
?storms
storms %>%
group_by(year) %>%
summarize(n = n()) %>%
ggplot() +
geom_line(aes(x=year, y=n)) +
labs(y = "Storm observations")Consider geom_line() for comparison, or geom_area() for showing cumulative distribution.
# observations over time
penguins %>%
group_by(year, species) %>%
summarize(n = n()) %>%
ggplot() +
geom_line(aes(x=year, y=n, linetype=species))
# storms by hurricane category:
# note: ~5-7 is the greatest number of categories you can color code before readers start having difficulty interpreting!
storms %>%
mutate(category = as_factor(category)) %>%
drop_na(category) %>%
group_by(year, category) %>%
summarize(n = n()) %>%
ggplot() +
geom_line(aes(x=year, y=n, color=category)) +
labs(y = "Storm observations")
storms %>%
mutate(category = as_factor(category)) %>%
drop_na(category) %>%
group_by(year, category) %>%
summarize(n = n()) %>%
ggplot() +
geom_area(aes(x=year, y=n, fill=category)) +
labs(y = "Storm observations")Small multiples AKA faceting replicate a plot in columns and/or rows, with one plot for each level of some categorical variable.
Want to combine multiple plots into one figure? Check out the patchwork package.
# one line graph for each storm category
storms %>%
mutate(category = as_factor(category)) %>%
drop_na(category) %>%
group_by(year, category) %>%
summarize(n = n()) %>%
ggplot() +
geom_line(aes(x=year, y=n, color=category)) +
facet_wrap(vars(category), ncol=3, nrow=2) +
labs(y = "Storm observations")
# a scatter for each island, with groups colored
penguins %>%
ggplot(aes(x=flipper_length_mm,
y=bill_length_mm,
color=species)) +
geom_point() +
facet_wrap(vars(island), nrow=2, ncol=2)labs() sets labelspenguins %>%
drop_na(flipper_length_mm, bill_length_mm) %>%
ggplot(aes(x = flipper_length_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
geom_smooth(method="lm") +
labs(title = "Gentoo tend to have the longest flippers",
x = "Flipper length (mm)",
y = "Bill length (mm)",
color = "Species",
caption = "Source: Gorman KB, Williams TD, Fraser WR (2014)")theme_ layers change the plot’s lookStart typing theme_ in RStudio to see available options, or install and attach the ggthemes package for more. There is also a theme() function for adjusting specific aspects of the plot.
Lastly, annotate() can add annotations to the plot, using the coordinate system.
ggsave(filename, plot) where plot is a ggplot() object you have assigned. .png, .svg, or .pdf in the filename determines the output type.
Other arguments to use: width, height, units
Today we learned about:
Join us next week for inference and modeling!
R 3: Data Visualization